COMPUTER PERFORMANCE PREDICTION OF A DATA FLOW ARCHITECTURE

David John Plogen

NAVAL POSTGRADUATE SCHOOL

Monterey, California

THESIS

COMPUTER PERFORMANCE PREDICTION OF A DATA FLOW ARCHITECTURE

by

David John Hogen June 1981

Thesis Advisor:

L. A. COX, Jr

Approved for public release; distribution unlimited

T199399

SeCURITV CLASSIFICATION Of THIS PAGE (Wtiam Dmm Ent9f^

REPORT DOCUMENTATION PAGE

\. ni^O^T NUMtCA

a. OOVT ACCCUION NO

4. TITLE (and Suhllllu)

Computer Performance Prediction of a Data Flow Architecture

7. AuTHonr*;

David John Hogen

•■ PcnronMiNO omoanization name ano aoomcis

Naval Postgraduate School Monterey, California 93940

READ INSTRUCTIONS BEFORE COMPLETING FORM

J- WECIPICNT-S CATALOG NUMBER

». TYPE OF REPORT ft PERIOD COVERED

Master's Thesis, June 1981

•■ PERFORMING ORG. REPORT NUMBER

t. CONTRACT OR GRANT NUM8ERr«)

to. PROGRAM ELEMENT PROJECT TASK AREA ft WORK UNIT NUMBERS

tl. CONTROLLING OFFICE NAME ANO AOORCSS

Naval Postgraduate School Monterey, California 93940

12. REPORT DATE

June, 1981

19. NUMSER OF PACES

91

14. MONlTOPlNO AGENCY NAME ft AOORESS<lf MUmnml trmm C«ilrellin« OlUe*)

IS. SECuniTY CLASS, (oi ihia report;

Mm. OeCLASSIFICATION/OOWNGRADlNC

SCHEDULE

IS. DISTRIBUTION STATCMCNT (»t thia Hmptt)

Approved for public release; distribution unlimited

17. DISTRIBUTION STATEMENT Co/ (A« aftatraef «if«r«4 In BUck 20, tl dlUtml /roM JIaportJ

IS. SUPPLEMENTARY NOTES

1*. KEY WORDS (Camtinu* an r«r«r«« alM II nmcammmrr MM i^MiKiy ftr ftloeJi ity»<r>

Data Flow Computer Performance Evaluation Dataflow Computer Performance Prediction

20. ABSTRACT (Ctnttm— an r«v«rM al*» tl

tmrr m»4 t^mtUIr kr M««* mum^at)

Supercomputers capable of performing extremely high speed computation have been proposed which are based on a n architecture known as data flow. Applica- tion of a Petri net-based methodology is used to evaluate the performance attainable by such an architecture. The architecture evaluated is MIT's cell block data flow architecture which is being developed to execute the applicative programming language VAL.

DO ,•;

FORM AN 73

1473 EDITION OF t NOV St IS OStOLETE

S/N 0103'014*6«0I I

•BCuniTY CLASSIFICATION OF THIS PAGE r*>>«n Omta Cnfaratf)

nclassif led

<«^an fW«« CaMVMtf-

Results show that for the data flow architecture to achieve its goal of

speed computation, intelligent multiprogramming schemes need to be loped. One such scheme, based on the notion of a "concurrency vector", is educed.

f

Forrn 1473

Jan 1 3 .i._i...»i«»..i.»— >-— -i— ^■^— -

0102-014-6601 If cu«i»» cwAmriCATioii 0^ ▼*••• ^•««'^«- »•'• '"••'•*'

Approved for public release? distribution unlimited

Computer Performance Prediction of a Data Flow Arcnitecture

by

David John Ho^en Lieutenant Commander, Onited States Navy B.S.O.E., United States Naval Academy, 1972

Submitted in partial fulfillment of tne requirements for the decree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

from the

NAVAL POSTGRADUATE SCHOOL June 1981

ABSTRACT

Supercomputers capable of performing extremely high speed computation nave been proposed, wnicn are based on an architecture known as lata flow. Application of a Petri net-based metnodology is used to evaluate tfte performance attainable by such an architecture. The architecture evaluated is MIT's cell bloci data flow architecture which is being developed to execute the applicative programming language 7AL.

Results Show that for the data- flow architecture to achieve its «oal of high speed computation, intelligent multiprogramming schemes need to be developed. One suca scheme, based on tne notion of a "concurrency vector", is introduced.

4

TABLE OF CONTENTS

I. INTRODUCTION 6

A. BACKGROUND e5

B. RESEARCH APPROACH 7

C. ORGANIZATION 9

II. LITERATURE REVIEW 10

A. APPROACHES TO PARALLELISM Ife5

B. PETRI NETS 18

C. CONCEPT OF DATA FLOW 26

D. COMPUTER PERFORMANCE PREDICTION 43

E. RE00ESTER-SER7ER METHODOLOGY 45

III. HYPOTHESIS.. 53

I?. METHOD 56

A. EXPERIMENTAL DESIGN 56

B. DATA FLOW HARDWARE DEFINITION 58

C. DATA FLOW SOFTWARE DEFINITION 63

D. PROCEDURE/IMPLEMENTATION 1^

V. RESULTS AND DISCUSSION 74

VI. SUMMARY 84

VII. RECOMMENDATIONS FOR FURTHER INVESTIGATION 66

BIBLIOGRAPHY 87

INITIAL DISTRIBUTION LIST 91

I. INTRODUCTION

A. BACK:GROaND

Despite tne orders-of-magni tude increase in computation speed that Has occurrei since tne early 1950'5, tne need still exists today for faster computers. Tnis need is most critical in tne area of scientific computing, wnere tnere exist computations requirin^r on the order of a Million floating point operations per second [DENNIS, 1980].

One approach to achieving higher computation speed is to increase tne speed of tne basic logic devices of the coTiputer. This approach, effective in the past, faces significant obstacles to future gains because of the speed of liffht limitation to signal propaeation and limitations in the integrated circuit manufacturing process.

A second approach to achieving higher computation speed is through the exploitation of parallelism which is (or can be) inherent in algorithms used to solve a wide range of scientific problems. Such parallelism can be present at both the operation and procedure levels in a program. Thus far, such exploitation of parallelism has not reached a limiting threshold to faster computation.

Data flow computing has been proposed as a conceptually viable method of achieving nlgher computational speed through greater exploitation of inherent algorithmic

6

parallelisTi. A computer based on tie data flow concept executes an instruction when its operands become available. No sequential control flov notion exists. Data flow programs are free of sequencing constraints except tnose imposed by tne flow of operands between instructions. TJius , a data flow coupuier contrasts fundamentally witn tne "von Neumann" model. Even so, tne data flow concept is capable of incorporating into one system all the known forms of parallelism exploitation including vectorization , pipelining, and multiprocessing.

B. RESEARCH APPROACH

It was tne purpose of tnis researcn to gain insignts into the decree of parallelism exploitation obtainable witn the data flow-based hign speed computation method. The classical issues of hardware utilization, program execution time, and "deeree" of muliipro^rammine were investigated in tne context of data flow. Application of an existing Petri net-based methodology was the technique used to ^ain these insights.

The hypothesis for this research had two parts. First, the suitability of the Petri net-based Requester-Server metnodology for prediction of tne performance of data flow machines in an efficient, accurate manner was to be explored. Second, a cnallenge to tne data flow concept was made. It was hypothesized that tne poal of achieving higher

speed coTiputation tnrougn data flow computing Is unattainable wittiout acnieving a higa and "intelliffent" degree of multiprogramming. By "intelligent" it is meant that the mapping of processes onto the hardware shall have to be done in a near optimal fashion, defined in terms of hardware utilization and program execution time.

To explore the two-part Hypothesis, sets of Petri net models of data flow programs, characterized by a range of inherent parallelism, were "executed" on Petri net models of the Dennis-Misunas data flow Hardware design [DENNIS, AUG 1974], usin? the methodology called Requester-Server [COX, 1978]. The nardware models were varied in the number of processing elements available for use in "executing" the sets of program models. Thus software models were "run" on hardware models, and appropriate performance indices were measured and analyzed.

This research is important because it suggests a method for mapping data flow programs onto the data flow machine to achieve tne desired degree of high speed computation. Additionally, the Requester-Server (R-S) methodology has been shown to be an effective tool for predicting the performance of data flow computer architectures.

Grateful acknowledgment is made to L. A. Cox, designer and initial implementer of tne R-S methodology, and to D. M. Stowers, who modified the R-S software to enable it to run on tne PDP-11/50 minicomputer at NPS.

8

C. ORGANIZATION

T&e results reported nere are organized in a lasnion conducive to the communication of experimental computer science endeavors, following a review of tne applicable literature (Section II), tne hypotftesis (Section III) is presented. Next, tne method used to test the hypothesis is presented in detail (Section IV). This section discusses tne experimental design which includes the identification of independent and dependent variables, cnaracteri zes and explains the Petri net definition of the data flow hardware and software, and ends witn an account of the procedure used to implement the experiment to test the hypothesis. Results of the experiment and a discussion thereof are covered in Section 7. This section, in addition to demonstrating tne suitability of the R-S technique, and exploring the multiprogranming response of data flow architectures, presents some unexpected findings. Section VI summarizes the entire research effort, including the results. Finally, Section VII presents recommendations for furtner investigation in the area of data flow researcn.

II. LITERATURE REVIEW

A. APPROACHES TO PARALLELISM

In general, computer science literature approacnes tne concept of parallelism exploitation from eitner an architecture (aardware) or language (software) point of view. In t&is tnesis, "parallelism" snail be viewed as existing at many hierarchical levels within algorithms. Any of several different computer architectures may he capable of exploiting the parallelism which exists at one or more of these various hierarchical levels. As should be expected, each architecture is best-suited at exploiting inherent algorithmic parallelism at a particular hierarchical level, but not at otners. In contrast, implementation of tne data flow concept proposes to exploit inherent alfforithmic parallelism at all hierarchical levels, in an efficient fashion. Before presenting tne concept of data flow, a review of the range of architectures and strate,eies currently used to exploit parallelism is presented.

In an early study of high speed computer architectures [FLYNN, 1966J a four element taxonomy was developed which classified computer systems in terms of the amount of parallelism in their instruction streams and data streams (see figure II. A. 1). (An instruction stream is the series of operations used by the processor; a data stream is the

10

series of operands used by tne processor.) Tne first eietient of this taxonomy, depicted in figure II. A. 1(a), is the serial computer wnich executes one instruction at a time, affecting at most one data item at a time. Sucn a serial machine is denoted as a single-instruction single-data-stream (SISD) computer. Tne SISD computer can be characterized as possessing no capability for exploitation of algorithmic parallelism. The three remaining computer system organizations within the Flynn taxonomy do possess capabilities for exploiting algorithmic parallelism.

By allowing more than one data stream a single-instruction multiple-data-stream computer results, as Shown in figure II. A. 1(b). This organization allows vectorlzation and is icnown as a vector or array processor because each instruction operates on a data vector during each instruction cycle, rather than on Just one operand. The

model in figure II. A. 1(b) shows N processors each accepting as input its own data stream. It is noteworthy that eacn of the N processors is not a standalone serial machine (SISD co-nputer) because tne N processors tate tne same instruction from an external control unit at each time step.

If the SISD computer is extended to permit more than one instruction stream, the multiple-instruction single-data-stream (MISD) computer snown in figure II. A. 1(c) results. This computer system organization within Flynn's taxonomy is present for completeness but has yet to be shown

11

Control unit

Instruction stream

Ariirimetic

Data stream

processor

a) Modat ol an SISO computer

Instruction stream

Arithmetic Processor 1

Data stream t

Arithmetic processor 2

Data stream 2

Control unit

Arithmetic procesaor N

Data stream N

b) Model o( ar« SIMO computer

Control unit 1

Instruction stream 1

Arithmetic processor 1

Control unit 2

Instruction stream 2 | Arithmetic processor 2

Control unit N

Instruction stream N Arithmetic \ processor N

Oala stream

c| Model ol an MISO computer

Control unit 1

Instruction stream 1 | Arithmetic | Data stream I processor 1 P

Control unit 2

Instruction stream 2 I Ariihmeiic Data stream 2 processor 2

Control unit N

Instruction stream N

Arithmetic procaMOr N

Oala stream N

d) Modal ol an MIMD computer

Figure II. A. l: COMPUTER MODELS [STONE, 1980]

12

to possess mucJi utility. An example of sucn a machine would be one built to generate tables of functions (sucn as squares and square roots) of a stream of numbers. Each processor would perform a different function on the same data item at each time step.

The fourth and final element of Flynn's taxonomy is one which possesses parallelism in both the instruction and data streams. This multiple-instruction multiple-data-stream (MIMD) computer (shown in figure II. A. 1(d)) is made up of N complete SISD machines which are interconnected for communication purposes. Such a parallel architecture is more readily recognized as a multiprocessor in which as many as N processors can be performing useful wort at the same time.

Beyond Flynn's taxonomy are other approaches to parallelism. The first, pipelining, is a strategy which mates use of the fact that a processor, in executing an instruction, actually performs a sequence of functions in various functional units of the processor. Each function is performed at a different stage alone the pipeline. Figure II. A. 2 Shows a processor with a simple pipeline design. Rather than waiting for each instruction to be completely executed before beginning the next instruction, the pipeline processor begins execution of the next instruction as soon as functional units at the befflnnine of the pipeline are available. Thus, the pipeline is normally full, containing more than one instructions in various stages of execution.

13

FUNCTIONAL UNIT-

D

I

I

N

0

0

N

S

D

C P A

F P

E

S

R

T

F

E

0 E D

E E

X

T

S E

R

E

C

M R D

T R

E

R

T S

FUNCTION -

U

T

0

PAR

C A

C

U

0 U

C

C

D

U N E

H N

U

C

R L

T

H

E

T D S

JD

T

T

E T

I

E S

E

I

0

0

N

N

Figure II. A. 2: PIPELINING

14

Ttie final approach to parallelism to be presented is tfte strateffy of overlapping. In the traditional sense, overlapping within a computer system occurs wJien the central processing unit (CPQ) is allowed to function concurrently with input/output (I/O) operations. Such concurrency was prevented in early computers because I/O operations required data paths to memory which ran through CPU registers, preventing CPU functions from occurring while performing I/O, Overlapping can occur in otner ways witnin a computer but the example given is sufficient to convey the general idea.

The techniques for exploiting algorithmic parallelism that nave been presented are not all mutually exclusive. Jor example, the strategies of pipelining and overlapping can be included in any of the four architectures. Furthermore, other more complex machine organizations have been proposed. One example is the multiple SIMD (MSIMD) macnine which consists of more than one control units snaring a pool of processors through a switching networit [HWANG, 1979] . Such hybrids will not be considered further.

Having described the major architectural approaches to exploiting algorithmic parallelism it is appropriate to characterize the problems for which each method is suitable and to present some of the difficulties that still exist in using each method. By "suitable" it is meant that the method allows the processing of a problem in such a manner that

15

soTie speedup In execution tlTie is achieved in comparison with what tae execution time would be for the problem run on a serial macnine.

The main Implementation of the SIMD architecture, the array processor, is suitable for computations whicn can be described by vector instructions. Also, operands processed simultaneously must be capable of being fetched simultaneously from memory. Finally, processor interconnections must support high speed data routine between processors. If any of tne above conditions are not met, then the computation may execute in a predominantly serial fasnion within tnis SIMD computer. Because of these required conditions, the array processor is generally considered to be a speciallized, rather than general purpose, machine.

As previously mentioned, the MISD architecture exists merely to complete tne Flynn taxonomy and will not be discussed further [STONE, 1980].

The dominant MIMD computer is the multiprocessor. The multiprocessor is considered to be a general purpose machine. Accordingly, many problems should be well-suited for execution on such an arcnitecture. Despite the fact that such systems have been shown to worfc well in a number of applications, especially those wnicn consist of a number of concurrently-processable subproblems with minimal data sharing, numerous questions have yet to be answered. These

16

questions include now to best organize tne parallel computations (i. e. partition the problem) to optimize the use of tne cooperating processors, now to synchronize tne processors in the system, and how to test share the data among system processors. Also, problems which possess an iterative structure can run efficiently on an array processor and avoid the overhead of synchronization and scheduling required of the multiprocessor [STONE, 1960J .

Although not considered "architectures" in the sense of Flynn's taxonomy, both pipelining and overlapping (of which there exist different types) are general purpose strategies that can be applied to most problems. Like the multiprocessor, these techniques also permit the partitioning of a problem so that several operating hardware pieces can function concurrently. Accordingly, pipelining and overlapping are often considered to be forms of multiprocessing. Tne difference lies in the fact that pipelining and overlapping perform partitioning at different hierarchical levels of a problem than does the multiprocessing technique.

Armed with an understanding of the diverse architectural approaches to parallelism exploitation that nave been used to date, it is logical to proceed with an alternative approach, that of data flow. Before doing so, however, Petri nets will be introduced. This is appropriate because the

17

concepts of data flow computation are a direct application of Petri net tneory.

B. PETRI NETS

Petri net tneory plays an Important part in this research endeavor for two reasons. First, as nas been mentioned, Petri net tneory forms the hasis for the concepts used to describe and define data flow computation. Second, Petri net theory Is the basis for the Requester-Server methodology that is used in this thesis researcn as a computer performance prediction tool. Because of its applicability, Petri net theory shall be presented herein, in an infor-nal manner, with empnasis placed on its use in modelling parallel computation. Those desiring a more formal and complete discussion of Petri nets are referred to [PETERSON. 1977],

Petri nets may be thought of as formal, abstract models of information flow. Their main use has been in tne modelling of systems of events in which some events may occur concurrently but there exist constraints on the frequency, precedence and concurrence of these events. A Petri net graph models the static structure of a system. The dynamic properties of a system can be represented by "executing" the Petri net in response to the flow of information (or occurrence of events) in the system.

18

The static grapn of a Petri net is made up of two types of nodes: circles (called places) wnicft represent conditions, and bars (called transitions) which represent events. These nodes are connected by directed arcs running from either places to transitions or transitions to places. The source of a directed arc is the input, and the terminal node is the output. The position of information in a net is represented by markers called tokens.

Tne dynamic execution of a Petri net is controlled by the position and movement of the tokens. A token moves as a result of a transition firing. In order for a transition to firet it must be enabled. A transition is enabled when all of the places which are inputs to a transition are marked with a token. Upon transition firing, a token is removed from each of the input places and a token is placed on each of tne output places of tne transition. Tnus, in modelling tlie dynamic behavior of a system, the occurrence of an event is represented by tne firing of the corresponding transition.

Figures II.B.l through 11.5.4 show a Petri net at progressive stages of execution. As can be observed, the status of the execution at a given time can be described by the distribution of the tokens in the net. Tnis distribution of tokens in a Petri net is called the net marking and uniquely defines tne state of tne net for any given instant.

19

Petri nets are uninterpreted models. Thus, some significance must be attached to token movement to indicate the intent of the model. This is usually done by labelling the nodes of a net to correspond in some way to the system being -nodelled. However, it should be remembered that the labelling of the nodes of a Petri net in no way affects its execution, k second attribute of Petri nets is tneir ability to model a system hierarchically. An entire net may be replaced by a single node (place or transition) for modelling at a greater level of abstraction or, conversely, a single node may be replaced by a subnet to show greater detail in the model.

Petri nets, as a formal graph model, are especially useful in modelline the flow of information and control in systems which can be characterized by asynchronous and concurrent behavior. Figure II. B. 5 shows the Initial marKin? of a Petri net model of such a system. Initially, transition El is enabled because each of its input places, CI and CH, is marfced with a token. Firing transition El removes one toten each from places CI and C2, and puts a token into each output place, C3 and Ci, At this point in the net execution, transition E3 Is disabled because one of its input places, C5, still has no token. Transition E2, nowever, is enabled, and upon firing causes a token to be removed from place C3 and one deposited in place C5. As an aside, this portion of the model could correspond to a system sequencing

20

constraintt tnat of event E3 navlng to wait until event E2 coinpletes. Upon firini? transition E3, places C6 and C7 become martced witft totens as places C4 and C5 lose a tofe'en each. Transitions E4 and E5 are now enabled and can fire simultaneously, the occurrence of which corresponds to concurrent events in a modelled system. Doing so, tnat is, firing transitions £4 and E5, brings the Petri net model bacif to its original (initial) configuration.

One other situation that can be represented usin«r Petri nets is that of conflict. Figure II. B. 6 shows a net model of such a situation. Simply, transitions El and E2 are both enabled. However, if either transition fires, the remaining transition becomes disabled. In such a case, it is an arbitrary decision as to which one fires. Because we would llfce to be able to duplicate experiments and obtain the same results, a scheme that is often used involves simply assigning priorities to transitions which are subject to conflict in a net. In this way reproducible results can be ensured. If true nondeterminism is desired in such a model, a scheme in which probabilities are associated with each transition can effectively model nondeterminism in tne system under study.

Thus, to properly model a system with Petri nets, every sequence of events in the modelled system should be possible in the Petri net and every sequence of events in the Petri

21

Figure II.B.l: MARKED PETRI NET, TIME=0

P4

A MARKER OR TOKEN

Figure II. B. 2: MARKED PETRI NET, TIME=1

22

P4

Figure II. B. 3: MARKED PETRI NET, TIME=2

P4

= A MARKER

OR TOKEN

Figure II. B. 4: MARKED PETRI NET, TIME=3

23

Figure II. B. 5: PETRI NET GRAPH MODELLING

ASYNCHRONOUS, CONCURRENT BEHAVIOR

24

Figure II. B. 6

PETRI NET GRAPH MODELLING CONFLICT

25

net should represent a possible sequence in the modelled system.

This section has introduced Petri nets and demonstrated tneir usefulness in fortially modelling information and control flow in systems characterized by asynchronous and concurrent behavior. Readers interested in the use of Petri nets for performance evaluation of sucn systems are referred to [RAMAMOORTHT , 1980] and [RAMCHANDANI , 1974]. The following section snail introduce readers to the concept of data flow which, as mentioned previously, is a direct application of Petri net theory.

C. CONCEPT OF DATA FLOV

Data flow computing is a method of multiprocessing which proposes to exploit inherent algorithmic parallelism at all hierarchical levels within a program. Additional objectives include effectively using the capabilities of LSI technoloey and simplifying the programming tasi. The concept of computation under data flow was derived by Dennis [DENNIS, 1974] (and a number of others wording independently), predominantly from Karp and Miller's [KARP, 1966] worft on computation graphs. This section begins by presenting the data flow concept from the perspective of language, rather than that of architecture. This approach is appropriate in view of the fact that data flow computer systems are heine designed as hardware interpreters for a base language that

26

is fundamentally different from conventional laneua^es. A hardware description of tne Dennis-Misunas data flow architecture design completes tnis section.

In a data flow computer, an Instruction is executed as soon as its operands become available. No notion of separate control flow exists because the data dependencies define the flow of control in a data flow program. In fact, data flow computers have no need for a program location counter.

This contrasts with the traditional "von Neumann" conputer architecture model which uses a global memory whose state is altered by the sequential execution of instructions. Such a model is limited by a "bottleneck" between the computer control unit and the Global memory [BACKUS, 1978]. This "feature" allows conventional languages to have side-effects, a common example of which is tne ability of a procedure to modify variables in the calling program. Such side-effects are prohibited under tne data flow concept. Furthermore, in data flow, no variables exist, nor are there any scope or substitution rules. In fact, the data flow concept prohibits the modification of anything that has a value. Rather, data flow computing tafces inputs (operands) and generates outputs (results) that have not previously been defined. Thus, instructions in data flow are pure functions. This is necessary so that instruction execution can be based solely on the availability of data (operands). Thus the data dependencies must be equivalent

27

to, and in fact define, tne sequencing constraints in a program. Also, to exploit parallelism at all levels, it must be possible to derive tnese data dependencies from tne nigh level lansuaffe program instructions [ACKERMAN, 1979J .

A language waicn allows processing by means of operators applied to values is called an applicative language. VAL (Value-oriented Algoritamic Language) is a nigh level data flow applicative language under development at MIT [ACKERMiN and DENNIS, 1979]. It prevents any side-effects by requiring programmers to write expressions and functions; statements and subroutines are not allowed in the language. Because of this constraint, most concurrency is apparent in a hieh level language program written in VAL. For tne purposes of this research, no further understanding of the high level language of data flow is required. Information about high level language alternatives is available in [McGRAW, 1980J , [ACKERMAN and DENNIS, 1979], and [ACKERMAN, 1979J .

At What would correspond to the assembly language level, a data flow computation can be represented as a graph. Ttie nodes of the graph correspond to operators and the arcs represent data paths. An arc into a node represents an input operand path? an arc leaving a node corresponds to a result path. Data flow graph execution occurs as operands become available at each node. Vhen the input arcs of a node each have a value on them, the node can execute by removing those values, conputing the operation, and placing the results on

28

X

9^ Z

Z = X + Y

NODE ENABLED

NODE AFTER EXECUTING

Figure II.C.l: DATA FLOW NODE EXECUTION

29.

stats

funotlon Stats (X.Y.Z: re»l return! real, real) let

Mean real := (X + Y + Z) / 3;

SD real := SQRT( (X^ + Y^ + Z' in

Mean , SD endlet endfua

) / 3 - Mean^ );

Figure II. C. 2; A SIMPLE STATISTICS FUNCTION AND ITS DATA FLOW GRAPH [McGRAW, 1980]

30

t&e output arcs (see Figure II.C.l). The example data flow ^raph in figure II. C. 2 computes tne mean and standard deviation of its three input parameters.

Such a graph notation is useful in illustrating tne various levels of parallelism in a program. For example, a grapn node may represent a simple operator such as addition, or the entire statistics function of figure II. C. 2. Tnus tne data flow graph notation can represent parallelism existing at the operator, function and even computation level. The graphs execute asynchronously, nodes firing when data inputs are available. Thus no syncnronization problem exists witn regard to accessing shared data. Each data flow path can be marfced with a value by only one operator node. Once a value is on a path, no operator 'can modify that value. The value can only te read when used as an input to another node [McGRAW, 19S0] .

Again, the data flow graph notation merely allows a logical representation of a program at a level corresponding to conventional assembly language. This logical representation snail now be extended to permit tne reader to understand the basic data flow hardware instruction execution mechanism. A simple example computation that shall be used to facilitate reader understanding is the following:

Z=(X+I)'i*(X-T)

31

X

. z

Figure II. C. 3

A SIMPLE DATA FLOW PROGRAM GRAPH

X

ADD

[ ]

X *

A

V

z

Figure 11.0.4:

AN ACTIVITY TEMPLATE FOR THE ADDITION OPERATOR

32

ADD

MUL

SUB

Figure II. C. 5: PROGRAM GRAPH USING ACTIVITY TEMPLATES FOR THE DATA FLOW PROGRAM GRAPH OF Figure II. C. 3 [DENNIS, 19 8(3

33

Tne grapn representation of tiiis computation is snown in figure II. C. 3.

In tne extended grapn representation scnemCt a data flow program exists as a collection of activity templates, each tenplate corresponding to a node In tne data flow program ffraph. For example, figure II. C. 4 s&ows an activity template for ttie addition operator. T&ere are four fields in tie activity template. Tne first field denotes tne operation code wnicH specifies the operation to he performed. The second and third fields are receivers, which are locations waiting to receive operand values. The fourth field is a destination field which specifies where the result of the operation on the operands is to go. There can be multiple destination fields. Figure II. C. 5 shows the program graph representation of figure II. C. 3, using activity templates.

Activity templates have been developed which control the routing of data for such program structures as conditionals and iterations. These templates are mentioned to point out the fact that graph nodes can represent not only simple operands but can also represent more elegant and necessary constructs.

Some definitions which are necessary to tne understanding of the data flow instruction execution mechanism, follow. First, a data flow program instruction is the fixed portion of an activity template and is made up of the opcode and tne destinations.

34

instruction:

<opcode, (lestinations>

Each destination field provides tne address of some activity- template and an input (or offset) denoting wnicn receiver of the template is the target.

destination :

<address, input>

Data flow program execution occurs as follows. The fields of a template which has heen activated (by tbe arrival of an operand value at each receiver) form an

operation packet:

<opcode, operands, destinations>

When the operation pacfeet nas been operated upon, a result packet of the form

result packet:

<value, destination>

is generated for each destination field of tne original activity template. Result packet veneration triggers the placement of the value in the receiver designated by its destination field. Thus, at a logical level, data flow program execution occurs as a consequence of operation

35

pactcet and result packet movements tnrougn a Tiacnlne described in detail below.

The basic data flow instruction execution mecbanism is s&own in figure II. C. 6. Tne data flow proffram, consisting of a collection of activity templates, is neld in tne activity store (see figure II. C. 6). Eacti activity template is uniquely addressable wittiin tne activity store. Wten an instruction is ready to be executed (i. e. the template is enabled), this address is entered in the instruction queue unit (established as a FIFO buffer).

The fetch unit is then responsible for: removing, one at a time, instruction addresses from the instruction queue, fetching the corresponding activity template, forming an operation pacfeet based on the field values in the template, and submitting the operation paclcet to an operation unit for processing. The operation unit processes the operation pacJcet by: performing the operation specified by the opcode on the operands, formins result pacK-ets (one for each destination field of the operation paclcet), and transmitting the result pacfcets to the update unit. The update unit fills in the receivers of activity templates (designated by the destination fields in the result packets) with the appropriate values. The update unit is also responsible for checfeing the target template to see if it has all receivers filled, thus enabling the template. If so, the address of

36

the enabled template is added at the end of the instruction

queue by the update unit.

At this point it is appropriate to discuss how and where

program parallelism can be exploited by tnis nardware.

"...once the fetch unit has sent an operation packet off to the operation unit, it .may immediately read another entry from the instruction queue without waiting for the instruction previously fetched to be completely processed. Thus a continuous stream of operation packets may flow from the fetch unit to th.e operation unit so Ion? as the instruction queue is not empty.

"This mechanism is aptly called a circular pipeline- activity controlled by tne flow of information packets traverses the ring of units leftwise. A number of packets may be flowing simultaneously in different parts of the ring on behalf of different instructions in concurrent execution. Thus the ring operates as a pipeline system with all of its units actively processing packets at ones. Tne degree of concurrency possible is limited by the number of units on the ring and the degree of pipelining within each unit. Additional concurrency may be exploited by splitting any unit in tne ring into several units whicn can be allocated to concurrent activities." [DENNIS, NOV1980]

The Dennis-Misunas data flow architecture for

implementing the described instruction execution mechanism

is called the cell block architecture and is illustrated in

figure II. C. 7.

"The heart of this arcnitecture is a lar^e set of instruction cells, each of which holds one activity template of a data flow program. Result packets arrive at instruction cells from the distribution network. Each instruction ceil sends an operation packet to the arbitration network when all operands and signals have been received. The function of the operation section is to execute instructions and to forward result packets to target instructions by way of the distribution network." [DENNIS, NOV19B0]

37

Figure II.C.B reflects a practical form of tne ceil blocK architecture which maices use of LSI technology and reduces the number of devices and interconnections. This practical form is obtainable by grouping the instruction cells of figure II. C. 7 into blocfcs, each of which is a single device. In tnis organization, several cell tlocus are serviced by a group of multifunction processing elements. The arbitration networfc channels operation packets from cell blocJcs to processing elements. Rather than employing a set of processing elements each capable of a different function, which is one design option, use of one multipurpose processing element type is the favored approach. Such an approach precludes the need for tne arbitration network to route operation packets according to opcode. Instead, it sinply has to forward operation pacicets to any available processing element. It is this design which forms tne basis for the system model used in this research effort.

How does tne basic necnanism relate to the cell bioci architecture? Figure II. C. 9 shows a cell block implementation. It differs from the basic mechanism in two ways. First, the cell bloct has no processing element(s) (operation unit(s)). Second, result pacicets targeted for activity templates held in the same cell block must traverse the distribution network before being handled by the update unit [DENNIS, NOV 1980J . This is tne Dennis-Misunas data

38

RESULT PACKET

OPERATION PACKET

e

OPERATION UNIT(S)

UPDATE

INSTRUCTION QUEUE

READ/WRITE ACCESS

ACTIVITY STORE

e

^

4

FETCH

READ ACCESS

Figure II. C. 6: DATA FLOW INSTRUCTION EXECUTION

MECHANISM [DENNIS, 19 8(3

39

f

I

c:>

w (^ o

CO

>H

Eh M >

M E-«

a <

o

c>-

.A-

T)

Pi; o

W M O <

I s

o o

W M CO H

O

2 O

M

Eh CO O kJ

0^ W Eh U CO

o

M

<->

s w Eh :s o pti >4

QMCOEHP^MOaDEHMOS

C>-

I

A

o

M Eh

Eh W

<; >^

W <

IX, a, o

J

^

Eh B-^

kJ W

D ^

CO o

w <

J

Figure II. C. 7:

DENNIS-MISUNAS CELL BLOCK ARCHITECTURE DENNIS, 198 0]

40

i

PE

t

1

PE

MULTIFUNCTION PROCESSING ELEMENTS

t f

PE

PE

ARBITRATION NETWORK

•CELL BLOCKS

DISTRIBUTION NETWORK

T f

V.

T

T

T

y

Figure II. C. 8: PRACTICAL FORM OF THE CELL

BLOCK ARCHITECTURE [DENNIS, 198 0]

41

^

UPDATE

J^ESULT PACKET

ACTIVITY STORE

FETCH

^

^

OPERATION PACKET

Figure II. C. 9: A SIMPLE CELL BLOCK IMPLEMENTATION

[DENNIS, 19 80]

42

flow architecture design. Other designs do exist; for examples, see [GOSTELOW, 1980J and [WATSON, 1979J .

D. COMPUTER PERFORMANCE PREDICTION

Computer performance prediction is an evaluation process which proposes to estimate the performance of a system not yet in existence (i.e. in some state of design). "Performance" simply means how well a system worlds. Tnis in turn connotes the concept of value. So, the purpose of estimating the performance of a system under design is to determine that system's expected value.

In order to quantify now well a system worlfs or shall

worfc, performance metrics called indices are used. Typical

indices and their definitions are:

THROUGHPUT RATE - The volume of information processed

hy a system in one unit of time

HARDWARE UTILIZATION - The ratio between the time tne

hardware is used during an interval Of time, and the duration of that interval of time

RESPONSE TIME - The elapsed time t5etween tne sub- mission of a program job to a system and completion of the corresponding job output.

Computer performance prediction can be achieved via

several different techniques. Each technique has limitations

and advantages. The technique utilized in this thesis is

that of simulation. The simulation technique involves the

representation, by a model, of certain aspects of tne

behavior of a system in the time domain. Observing tnese

43

aspects of the t)ehavior in time of the system's model, under inputs generated by a model of tne system's inputs, produces results useful in the evaluation of the modelled system [FERRARI, 197SJ. For tne purposes of this research, the aspects of behavior that are of interest are the performance indices previously defined.

Of significant importance to any simulation effort are the issues of validation and parameter estimation. Conceptually, validation attempts to establish some degree of confidence that the simulation shall produce results which Shall closely correspond with the performance of the system under scrutiny. Parameter estimation provides the simulation effort with hopefully credible parameter values needed to perform a simulation having relevant results. These issues shall be addressed in section IV. A: Experimental Design.

The last section of the review of the literature applicable to this research endeavor presents tne Requester-Server methodoloffy. The Requester-Server methodology is the "tool" used to perform the simulation which generates the results on which the prediction of data flow performance is based.

Readers desiring a more thorough presentation of the subject of computer performance prediction are referred to [FERRARI, 1979J, [COX, 1978], [ALLEN, 1980J , [HA^^MIN(J, 1975], [SPRAGINS, 1980], [BQZEN, 1980], and [SAUER, 1980].

44

E. REOOESTER-SERVER METHODOLOGY

The Requester-Server (R-S) metHodolOffy was desiened and Initially implemented by L. A. Cox, Jr. [COX, 1978J . Subsequently, the Requester-Server software was modified by D. M. Stowers [STOWERS, 1979] to run on the PDP-11/50 minicomputer at NPS. This section summarizes those portions of [COX, 1978] and [STOWERS, 1979] whicn are applicable to and necessary for the understanding of this research.

The R-S methodology is capable of predlctinff tne performance of computer systems characterized by asynchronous, concurrent behavior. The methodology can predict performance at both the computer system and computer Job levels. The R-S methodology allows the user to separately specify the hardware conf iguration(s ) to be evaluated, the software (programs) to be used in evaluating the hardware conf iffuration(s) , and the mechanism or policy for allocating hardware resources to program requests for service. The methodology mates provision for variable levels of detail (in a hierarchical sense) in both the hardware and software. Finally, the R-S methodology is capable of simulatine concurrency in both the hardware and software. Thus, for a given hardware configuration, the control structure mandated by the software can be mapped onto tne hardware and system performance analyzed and predicted.

The simulation process is begun by representing the software (programs) and hardware as two separate Petri net

45

grapns. In tne Petri net grapn of tne software, eacn arc can be thought of as bavins an associated propagation delay, ttie extent of wnicti is dependent upon tne nardware configuration used to execute tiie program. If these delays are definable by their correlation to the Petri net model of the hardware, then perfor'nance values for tne indices of section II. D (Computer Performance Prediction) can be obtained by executing tne Petri net model of the software on the Petri net hardware conf i«uration(s ) . The R-S "tool" serves as the interface between the Petri net model of system software and Petri net -nodel of system hardware. This interface permits the hardware and software Petri net graphs to be constructed separately. Tnis is important because tne control structure and sequencini? constraints of both hardware and software can be maintained separately. This permits a direct and meaningful representation of both the system software and hardware being modelled.

The source file which serves as tne input to tne R-S program is organized into three sections. The software section of the input file consists of a description of tne Petri net graph representing the software program(s) to be executed. This net graph description is formulated in terms of the functions and constraints of the services required of tne hardware. The hardware section of tne input file is made up of a description of the computer system components and their interconnections. This description can be

46

(nlerarcnically ) at a bit-level or major component level, depending on t&e system aspects under scrutiny. The Petri net grapn upon wnich tne nardware description is based is constructed in terms of its operation in time. Tne last section of t&e input file, called tne dynamic section, provides tne user of tne R-S "tool" a place to denote system Initial conditions by defining tne hardware and software nets' tofeen markings at tne beginning of a "run". A,s may be recalled from section II. B (Petri Nets), both the software and hardware sections merely define static Petri net structures. Performance prediction follows frcn tne attachment of significance to the structures and restrictions on tofcen movement within these structures.

The dynamic nature of Petri nets is exploited by this R-S methodology as follows. The software net representation mafees a series of requests for the services of tne nardtoare

net representation. Repeatedly, the R-S process maps these requests for service onto tne hardware net representation. At each "invocation" the R-S process "runs" the hardware nel to provide the service requested by the software net. Upon completion of each of the service requests, the R-S process "runs" the software net representation until the hardware is again needed. This cycle repeats itself until tne software net representation has been completely "run" and its terminal state reached.

47

Events in tne hardware net grapn correspond to operations in time. A collection of events is used to represent each functional unit. Tolfen movement through t^e hardware net graph corresponds to the flow of data and control throueh the modelled hardware system. A simple hardware net description is provided in figure lI.E.l. Events in the software net graph correspond to requests for service. As an example, an event could equate to a request for a floating point multiplication. The flow of tokens in the software net srapn equates to the loeical flow of the algorithm, constrained by its implicit data dependencies or sequencing constraints. A simple software net description is provided in figure II. E. 2.

Together, tne software and hardware net graphs can be executed in such a way as to simulate the operation of the computer system for the given software worlcload. The interaction of the two net graphs is orchestrated by the R-S tofcen arhiter. Networlr simulation bpgins with the marJcing cf the "begin" node of the software net ^raph. This net graph is then executed as would be any Petri net graph. The arrival of a tofeen at any place in the software net graph indicates a request for service, at which time the R-S toKen arbiter tafces control. (The type of service requested is denoted by the type of the place and is defined in the software net description.) The R-S tofcen arbiter removes the toten from the software net and then permits the software

48

net ?rap& to continue executing until no furtner Tioves are possible. Tne R-S toien arbiter tnen initializes tne hardware functional unit (net grapn denoted by the type of service requested) by iiarKin^ it witn tokens. Tne nardware

net srraph is then executed one step. Totcens reaching events corresponding to service completion are removed, and the toten of the software net which originally caused the request for service is replaced, by the R-S tolcen arbiter. Repeating this sequence of actions results in the execution of tne software net grapn by the nardware net grapn. A sample input file dynamic section and the results obtained from executing tne software and hardware net graph descriptions of figures lI.E.l and II. E. 2 are presented in figure II. E. 3. Those readers interested in the Requester-Server metflolology are referred to [COX, 1978J and [STOWERS,. 1979] for a more in-depth discussion of its

capabilities and usage.

This completes the necessary review of the literature required to understand the researcn tnat follows. The fundamental concepts of the various approaches to parallelism, Petri nets, tne lata flow architecture, computer performance prediction, and the Requester-Server methodology have been reviewed. The next section presents tne two-part hypothesis wnicn tnis researcn addresses and tests.

49

CO

w

u

O M

M W

H OO Oh

0^ W Q Q <

<

Eh O

0 d)

1

-P o

0)

•H

?H

W >

^ 5^4

f^

(U QJ

0

l4-( CO

(1)

T3

^^

?^ Hh

(1)

ID

O

Tn

0)

w

•H

■p

CU tD

>

w

>H C

0

<u

Eh-H

?^

a

M pu (y

Q

w

W

LO

o •«

LO •« •«

W o o W

Ph W Oh •«

rt

«\

>H Oh U W >H S

S

S

H >^ Oh Oh H O

o

o

Eh >H >-i M

H

M

•-Eh Eh H Eh Eh

Eh

Eh

H S Eh S M

M

H

W W 2 Eh H W CO

CO

CO

S > W S S > S

S

S Eh

w > w w w <

s

< .«s w

w w > > a:

P^ W LO S

S W W pq u^ Eh .«w •-

H

H Eh H

H Eh LO Eh lO Eh rH

.«CN|

•-< D W

D::<SH<MDrHS<3

CM rH D

OO CM (LD O S

CJCDMDDOEHMC5

tH D

Eh D M

^ Eh

Eh

H H d:

:gwwwwwwEHHn3

•-W Eh D

"^W Eh ^3 D --O

P^CC;pi;p:^p:iP:iDDO4<HCi::D0H

CM fti HD Oh Oh m <

Eh<0hEhEh<(1hEhEhEhS

JS D

hJ S ^ D

CDUCJOCJOOMMO

Q O M O

Q O M O O Q Q

w w W W W w w

S W

S W S 2:

PQ Q Q Q Q Q Q

W Q

W Q WW

Figure lI.E.l:

A SAMPLE HARDWARE NET GRAPH AND INPUT FILE DESCRIPTION

50

r\

00

Pi

w

Eh

cn

M

CD

w

p^

z

M

CD

S

H

K

..

Eh

s

>-i

<

P:i

p^

W

CD

:> ^^

O

w

cc;

w + +

a.

S ^o S

2:

M

<

CD 11 11 Q

p^

M S

Eh

oa M J w

Pi;

o

U-,

CJ CJ

o

M Eh

w

CO M Pi CU M «

Eh M

M

W

W LO un o

p^

^

.«>H H w w S

s

W Eh CU Cu Oh O

o

hJ >h >h >h M

M •"

O^b^b^ B^ ^b^

Eh W

S S M

M J

< W Eh Eh Fh C/^

00 CU

X > S S S S

S s

W pq w w w <;

> > > (^

< <

Pi; X

SSWWWtnS •-•-

H •« w

<; M l-H >^ ^^ «CD^'-DQrHCD+ +

.«Q

04 M t-D S S

CDW+ +SEhW'-d2:

Eh + + W <

O CQ i-D S W OO CQ

CO ►t) S !^

tf; Eh H

« #*

b^ --CD

PhWWWWWEhDD

rH

W Eh Eh r) CvJ O

P^Q^P^P^P^ZDIXtfX,

^

p:: D D O, Eh Pi;

00

< a, a. Eh 00 (X,

J s 12: D

CDOOOOCJMOO

Q

CJ M M O Q Q

w w w w w w

S

W S IS

cQ GQ G aa

W

Q WW

Figure II. E. 2:

A SAMPLE SOFTWARE NET GRAPH AND INPUT FILE DESCRIPTION

51

BEGIN DYNAMIC NET;

MARK GATE WITH 1;

COMMENT: GATE ENABLED TO ALLOW ONLY ONE OPERATION IN PROGRESS AT ANY TIME.

EXECUTE 10;

COMMENT: EXECUTE TEN HW CYCLES

OR UNTIL PROGRAM IS

COMPLETE.

END DYNAMIC NET;

S5: EXECUTE 10;

*PROGRAM EVENT J+K REQUESTS HW SVCS(l) *PROGRAM EVENT M+J REQUESTS HW SVCS(l)

TIME = 1 TIME = 2 TIME = 3

TIME = 4 TIME = 5 TIME = 6

^PROGRAM EVENT J+K COMPLETESO)

•^PROGRAM EVENT M+J C0MPLETES(6) 36: END DYNAMIC NET;

Figure II. E. 3: A SAMPLE INPUT FILE DYNAMIC SECTION

AND OUTPUT FILE LISTING

52

III. HYPOTHESIS

Because their exist several data flow architecture proposals, it is desirable to have a tool witn whicn to predict the performance of the diverse designs for comparison purposes. Tne first part of this research's hypothesis was that the Petri net-based Requester-Server (R-S) methodology is such a tool, capable of predicting the performance of data flow architectures in an efficient, accurate manner. In effect, the R-S tool was to be tested.

Tae second part of tnis researcn's nypotnesis was concerned with the Dennis-Misunas data flow architecture design. This design was chosen for two reasons. First, there existed adequate information In tne literature about this design on which to base an accurate model for simulation purposes. Second, tne Dennis-Misunas design of the basic instruction execution mechanism is essentially the same as several other schemes in various stages of implementation [DENNIS, 1979]. The hypothetical challenge to tnis design was that the goal of achieving higher speed computation is not attainable unless a nign and "intelligent" degree of multiprogramming is realized, as shall be explained next.

Obviously, high speed computation shall require a high hardware utilization. By this it is meant that most of the processing elements (PE's) shall have to be performing

53

useful woric most of tne ti^e. Sucn a nign nardware utilization is attainable wnen eitner of two situations occurs. First, a nign nardware utilization will result when a process possessin? a laree amount of inherent parallelism is being run (by itself) on the machine. In this case, a program's execution time is dependent upon its amount of inherent parallelism and the number of PE's in the machine. Second, a high hardware utilization is attainable when a multiprogramming environment (in which several processes are permitted to simultaneously run on the machine) is instituted. In such a multiprogramming environment, an individual process shall be competing for hardware resources (PE's). Thus, that process' execution time may be lengthy regardless of its amount of inherent parallelism. This is so because that process may have the use of only a small portion of the machine's resources (PE's) at any point in

time. Put another way, if at any time a process has N instructions available for execution, but there are less than N PE's available for executing those instructions in a parallel fashion, then the process' execution time shall be lenethened over what it could be if it had sufficient PE's available.

54

In sucn a situation, a scne:7ie may be needed to implement a policy wbicb achieves two ol)jectives:

1, maintaining nign nardware utilization and

2. providing an acceptable average response time for a user requiring a given amount of processing.

"Acceptable average response time" is construed to mean tnat tne actual response time of any particular program which requires a ^iven amount of processinff shall not be lengthened considerably over what it would be if tne program were executed by itself on the data flow macnine. Thus, it shall be desirable to minimize the affect of system load on an individual program's execution time. That the second objective should be met even at the expense of the first objective is a strong point made by [KLEINROCK, 1976J . By merely mapping processes onto the data flow machine as they arrive, it is expected that objective #1 shall be achieved but at the expense of objective #2. This situation was expected to be demonstrated by this research.

The purpose of this section has been to "frame" the research area by presenting the issues which ^ive rise to the hypothesis. The following section presents the method used to test the hypothesis, and includes a discussion of the assumptions male to facilitate the simulation experiments, where undecided design issues remain.

55

17. METHOD

A. EXPERIMENTAL DESIGN

The experiment to allow prediction of data flow computer performance involved executing sets of Petri net models of data flow proerams on Petri net models of data flow hardware. The Requester-Server (R-S) program tool monitored the data flow (model) programs' "execution" and provided data which permitted the determination of the performance indices: response time and hardware utilization. It is important to realize that the results of this research predict the performance of a model of a data flow computer- not that of an operating data flow machine itself. (Model validation and parameter estimation issues are addressed in section IV. B: Data Flow Hardware Definition.)

The reader who is familiar with analytic modelling employing queueing theory may ast wny tnat technique, rather than the simulation technique, is not used to predict tHe performance of the data flow design. The answer is that the analytic approach unnecessarily constrains the prediction by requiring assumptions to he made ahout the software. Specifically, tne Petri net models of software programs are discretely defined with regard to the amount of inherent parallelism available for exploitation at each time step in program execution. To model analytically, the variability of

56

inherent parallelism available for exploitation must be described by probability distributions wnicb nide tne definable nature of the programs at discrete time steps.

For tnis experiment, tne data flow arcnitecture was to be modelled with several different quantities of processing elements (PE's). Tne sample data flow program models were to be characterized by varying but definable amounts of inherent parallelism available for exploitation. Each (model) program was to be separately run on each (model) hardware configuration. (Hereafter, the word "model" shall be omitted but assumed in referring to the program and hardware models used in this experiment.) Data was to be obtained to permit determination of the performance indices (response time and hardware utilization), for each run, from tne monitor function of tne R-S tool. After running each program separately, arbitrary program mixes were to be run on each hardware configuration and the same performance indices again determined. Finally, hand-optimized program mixes were to be run on each hardware configuration and the sane performance indices determined once again. By evaluating the results, the hypothesis was expected to be either supported or refuted.

The independent variable for this experiment was defined to be tne quantity of PE's available to tne hardware model. Because PE's are but one resource demanded by a process in execution, other independent variable choices could nave

57

included other resources such as: the quantity of cell blocks available, the type of distribution or arbitration networfc employed, and/or the type of PE's (multipurpose or sets of single-purpose functional units) utilized. Expanding the number of independent variables increases significantly the complexity of evaluating the results. How these issues were resolved is explained in section IV. B.

The dependent variables for this experiment were the parameters response time and hardware utilization. The results of the experiment were expected to provide data which could be plotted on graphs. Curves plotting the execution time of each data flow program against the number of PE's would constitute one such graph. Others, and their significance, are presented in section V: Results and Discussion.

B. DATA FLOW HARDWARE DEFINITION

The Petri net models of the data flow hardware configurations were quantified in terms of their operation In time. Such quantification required assigning tine duration values to each portion of the cell bloct architecture -nodel in sucn a way as to closely model tne hardware. Doing so required several assumptions to be made. Those assumptions shall be addressed individually so as to help substantiate the credibility of the resultant hardware models.

58

To tegln, the processing elements (PE's) were assumed to te multifunctional, capable of executing any instruction routed to it in one "standard" instruction execution tirre unit. Allowing tne PE's to be multifunctional and characterized by a singular execution time simplifies tne modelling process.

There were at least two other possibilities that could be accommodated by expansion of the Petri net models. First, each multifunction PE could be replaced by a set of single-purpose PE's, each sin*rle-purpose PE defined in terms of its particular instruction execution time, and capable of executing concurrently with other PE's of the set. Second, each PE could be replaced by a subnet in which only one instruction could be executed in any given time step, but the model would define the execution time as a function of the instruction type.

The first alternate approach implies a more complex arbitration network with a conceivably longer routing time. The second alternative would require additional net complexity. (However, this approach would be a good possibility for subsequent research.) Because the actual Implementation configuration has not been finalized, modelling the PE's as multifunctional and characterized by a singular execution time was a reasonable path to follow.

The distribution network design also has not been finallized. For ease of modelling purposes, a crossbar

59

switch design capable of supporting simultaneous transfers of result pacfcets to cell blocts was cnosen to be modelled. This choice permitted a standard routing time to be Characterized by the model. Otner networa: designs, especially packet routing networks, may "be preferred to the crossbar switch for the ultimate machine because of their lower cost and comparable performance in a data flow architecture [DENNIS, 1979J .

The Choice to model the PE's as multifunction units precluded the need for anything but a simple arbitration network. Such a network would merely have to route operation packets to any available PE. Accordin<?ly , in the model, a standard routing time for this network was characterized.

With regard to the cell blocks, tne assumption was made that sufficient cell blocks were available to hold all portions of all processes being run on tne macnine at each and every instant. Thus, there is no notion of paeine portions of processes into and out of memory (the activity store in the case of tne data flow arcnitecture) . This assumption carries with it the assumption that all program compilation (resulting in extended data flow graph-like representations) is complete before beginning program execution. Other compilation strategies are under consideration, such as requiring the user to interact with the system to achieve a high degree of parallelism exploitation [McGRAW, 19S0J .

60

As has been described, other hardware choices (representine Independent variables in an experiment) can he made and easily Implemented by simply defining appropriate subnets wnlcn, in a time-wise fasnion, cnaracterlze tne portions of the hardware under scrutiny. The approach tasen In tftis researcn permitted tne nardware timing characteristics to be a function of simply tne number of PE's. Figure IV.B.l is the Petri net representation (of the cell blocic architecture nardware) utilized in tnis researcn. For the purposes of this experiment and in tne conf ifi-uration described, each PE was assumed to be driven at the rate of two million floating point operations per second (FLOPs), a rate claimed to be reasonable by [DENNIS, 1980J . This figure represents an instruction execution time of 500 nanoseconds (nsec). (This is represented in the hardware model by signifying a scaling of eacn event /transition pair to equal 100 nsec.) Associating timing characteristics wl tn each component in the data flow architecture design results in a similar figure as snown in figure IV. B. 2.

PB (instruction execution) 50 nsec

CELL BLOCK (memory fetcn assuming MOS tecnnology) 250 nsec

DISTRIBOTION NET*fORK (assuming crossbar switch) 250 nsec ARBITRATION NETWORK (assuming negligible)

[WEITZMAN, 1980]. TOTAL: 550 nsec

FIGURE IV. B. 2: TIMING CHARACTERISTICS OF DATA FLOW

ARCHITECTURE COMPONENTS

61

Figure IV.B.l

PETRI NET REPRESENTATION OF THE CELL BLOCK ARCHITECTURE HARDWARE

62

For the purposes of tnis research, tne quantity of PE's in tne nardware was varied from one to sixteen, by multiples of two. This resulted in data s-enerated for tne following quantities of modelled PE's: 1, 2, 4, 8, and 16.

This summarizes tne assumptions utilized in developing the model presented here. The following section describes the Petri net definition of the data flow software- program models which were "executed" on the hardware models.

C. DATA FLOW SOFTVARE DEFINITION

The Petri net models of data flow programs were quantified in terms of the amount of Inherent parallelism available for exploitation at each discrete time step as well as in terms of the implicit data dependencies of the programs. (As previously mentioned, the data dependencies define the control flow of a program.) Tne initial approach involved talrine sample programs written in the high level language (nil) VAL and converting them to their equivalent Petri net representations for subsequent "execution" on tne data flow hardware models. The problem with this approach was that the compilation process is not yet developed. Tnus, what hardware instructions would be required for each hll instruction were not determinable.

The subsequent approach, which was utilized, involved designing Petri net program models characterized by various but discretely definable levels of inherent parallelism.

63

Executing sucb artificial programs conceivably produced more informative results than would nave been obtained witn a few select programs which may have only demonstrated data flow's suitability for those special purpose computations. The individual programs shall be characterized after introducing a new concept.

A new concept introduced at this point is that of a software "concurrency vector", k concurrency vector is a tuple, each entry of which defines the amount of inherent parallelism in a program at the operation paclcet hierarchical level, at a discrete instruction execution time step. Each entry of the tuple is implicitly subscripted by the time step it describes. For example, the simple statistics function of section II. C (see figure II. C. 2, paee 32) would he characterized by the concurrency vector: (4,2,2,2,1,1). In this example concurrency vector, the "4" represents the fact that the four operations "+", "SQ", "so", and "SO" could be processed in parallel during the first time step of execution of the simple statistics function. This is so because no sequencing constraints exist among these four operations. Thus the concurrency vector defines how many operation packets could be parallel processed if all the instructions (i.e. functions- addition, subtraction, division, square, square root) were implemented in hardware. (If they were not, the subfunctionai operation pacfcets required by the instruction would be considered in

64

defining tne concurrency vector entries.) It snould also be recognized ttiat tiie concurrency vector, thoueb a function of a progra-n, is dependent upon a standard instruction execution time duration. If the hardware is implemented sucn that execution time is a function of the instruction type, then the concurrency vector entries could he described at an even lower level than the operation pacicet level. Such a level would correspond to a basic hardware cycle time, where executing an instruction would require some number greater than one hardware cycles to complete. This additional complexity need not be considered in this research in view of the hardware design approach taken, but could be accommodated by the R-S methodology used here.

Four programs i" k" through "d") were utilized in this research. These programs are diff erentiable by their length as well as by the amount of inherent parallelism available for exploitation at each time step. The Petri net representations of these programs are shown in figures IV.C.l through 17. C. 4. Additionally, tne concurrency vector for each is shown. The program mixes for this experiment included one of each of the three programs, "a", "B" and "d". Another program mix including one of each of the four programs, "a" through "d", was also used.

The following section presents the procedure utilized in executing the experiment. Additionally, the method of

65

FIGURE IV.C.l:

(MODEL) PROGRAM "A" CV: (4,3,2,1,2,3,4)

66

FIGURE IV. C. 2: (MODEL) PROGRAM "B"

CV: (8,8,8,8)

67

FIGURE IV. C. 3

(MODEL) PROGRAM "C"

CV: (6,6,5,5,4,4,3,3,2,2)

68

FIGURE IV. C. 4: (MODEL) PROGRAM "D"

CV: (3,5,7,2,1)

69

mapping tne program mixes onto eacn of tne nardvare configurations is explained.

D. PROCEDURE/IMPLEMENTATION

The four procedural steps utilized in executing tnis experiment were as follows. First, Petri net models of botn tbe hardware configurations (figure I7.B.1) and software programs (figures IV.C.1-.4) were converted to a format acceptable as input to the Requester-Server (R-S) program. Two Pascal programs, compiled and executed on the NPS "B-side" PDP-11 (a UNIX-based system), facilitated the (separate) generation of the hardware and software portions of the input files for the R-S program. Eacn input file was formed by concatenating the hardware and software portions and then editing the resulting file to define the dynamic execution desired. The second step was to transfer each complete input file from tne NPS "B-side" to the NFS "A-side" PDP-11 (an RSX-llM-based system), via an inter-processor lini:. Thirdly, the R-S program was run on the "A-side", tatlng as input the file wnich defined the hardware, software, and dynamic execution desired. Fourth and finally, data regarding the execution of the software on the hardware was obtained from the output file generated by the R-S program. The results of the data analysis are presented and discussed in section 7 (Results and

«

Discussion) .

70

Tne Implementation portion of tnis section addresses tne technique used to map the software onto the hardware in such a way as to effectively simulate this function as it might be done on a real data flow machine. The first set of experimental "runs", which consisted of the separate running of each program on each hardware configuration, was straightforward in implementation. The procedure described above, in which each program file portion was concatenated with the appropriate hardware file portion, achieved a relevant mapping for modelling a single process running on a particular hardware configuration. The subsequent set of experimental "runs", in which a program mix was mapped onto each of the hardware conf isuratlons , was not so straightforward in implementing as shall be explained next.

To understand the mapping of software (i, e. processes) onto data flow hardware, it is helpful to scrutinize the functions of the operating system for sucn a macnine. Because the scheduling and synchronization of concurrent activities are built in at the hardware level, a data flow machine's operating system will only be responsible for initialization, termination, and input/output (I/O) of processes. Once a process is mapped onto the data flow machine, it runs to completion without further intervention by tne operating system (except for I/O). The question which must be answered is: When should another process te mapped onto a machine which is already executing one or more

71

processes? Thus, in defining the input file of a program mix representative of ready processes, the program mix had to be defined in terms of a mapping function.

THe mapping functions can be thought of as operating system assignment policies. Thus, for those runs Involving program mixes (as opposed to single programs), an assignment policy had to be simulated. One aspect of this research then can te viewed as an investigation of different policies for mapping processes onto data flow machines in a multiprogramming environment.

Each program mix consisted of the three programs "a", "b" and "d". (A later "run" for which data was gathered utilized the four program mix consisting of one each of the programs "a" through "d".) Each mix was varied in the way in which It was mapped onto the hardware, in simulating different operating system mapping functions. The operating system assignment policies for mapping a program mix onto the hardware configurations follow. Three policies were sinulated. First, the tnree programs were permitted to begin "execution" at the same time. Second, an "80% Rule" was simulated in which an additional program was permitted to begin "execution" whenever the hardware utilization dropped below 80%. Third, an "intelligent" assignment policy was Implemented via a mapping function based on the programs' concurrency vectors. This assignment policy, it was envisioned, would cause optimal performance in terms of the

72

performance indices: response time and Hardware utilization. The concurrency vector approach optimizes tne assignment of processes onto the macnine by fitting together concurrency vectors of ready processes in such a way that the objectives noted in section III are acnieved. For example, given a machine with eiffht PE's, the concurrency vectors would be fitted as saown in figure IV.D.l.

JOB

"A":

JOB

"B":

JOB

" _ "

C :

JOB

D :

JOB

E :

TOTiL:

(4,3,2,1,2,3,4) (3,5,7,6,5,2) (4,4,4,2,1) ..8,4,4)

(2,7,8,.

,S,o,8,8,o,o,8,8,S,S,7,3, = «==TIME=====>

FIGURE IV.D.l: AN EXAMPLE OF FITTING* CONCURRENCY

VECTORS TOGETHER

By generating tne concurrency vectors at compile time, the program can declare beforehand those resources (as a function of tine) needed for execution as well as wnen ttie program will be completed. An operating system can thus choose the seo.uencing of the running of the waiting processes to achieve the best fit to best meet tne objectives of section III.

The results of these experimental runs are presented in the following section in graphical form. Additionally, tbe meaning and significance of the results are discussed.

73

V. RESaiTS AND DISCOSSION

In response to tne first part of tnis researcn's hypothesis, it is proposed that the Petri net-based Requester-Server (R-S) methodology is indeed a desirable tool with which to predict the performance of the diverse designs of data flow architectures. The ability to separately specify the hardware, software and resource allocation policy was an R-S feature which permitted efficient generation of the combinations of the above three ite:ns. The ability to easily implement variable levels of detail in bota the hardware and software was not exploited but the method for doing so was introduced. Finally, tne R-S methodoloffy's capability of simulating concurrency and asynchronous benavior in both tne nardware and software is a necessity for accurately modelling and simulating data flow computing.

The results which address the second part of this research's hypothesis are now presented. To begin, figure V.l shows individual program execution times as a function of the number of processing elements (PE's). These absolute execution times were used as a basis for comparison with the results from the multiprogramming environment runs. Percent hardware utilization is displayed adjacent to each data Doint. The hardware utilization values are averages of tne

74

21

E X

E C

u

T I 0 N

T I M E

(^S)

15

12

9 .

6 ..

3

10 0%

100% 50%

12 4 8 QUANTITY OF PE's

16

FIGURE V.l: EXECUTION TIME (}iS) vs QUANTITY OF PE'S

75

hardware utilizations at each tine step during execution. Tne grapn snows tnat a program's execution ti-ne can be drastically reduced by increasing the number of processing resources (PE's) available up to tne point wnere execution tine is bounded by the amount of inherent parallelism available for exploitation in tne program.

The followine results pertain to the running of the program mixes in a simulated multiprogramming environment. Initially, it was intended tiiat program mixes would contain a greater quantity of programs than were actually run. This was not acnieved due to time constraints. Accordingly, the results should be considered preliminary in nature. On a positive note, the results provide insight into several data flow operation issues. Figure V.2 provides the raw data for this research with the exception of the data utilized in computing nardware utilization. Figures V.3, V.4, and V.5 present the hardware utilization (as a function of time) for the 4-, 8-, and 16-PE configurations. Similar graphs for the 1- and 2-PE configurations are presented in figure V.6 (note the identical nature).

The implications supported by this data follow. Tney are not definitive because of the small quantity of programs and program mixes in the model, thus, while tne second part of the hypothesis may not have been adequately tested, the metnodology for doing so appears to be available in the R-S

76

1 PE

2 PE

5

4 PE's

sep.

C7

eid%

A£T

sep. C7

80%

ABT

A

9.5

5.5

9.0

12.5

17.5

3.5 4.5

6.5

9.0

B

16.0

9.0

17.5

13.0

14.5

4.0 9.0

6.0

7.5

C

20.0

10.0

-

6.0

-

D

9.0

5.0

11.5

17.5

16.0

3.0 7.0

9.5

6.0

A7G

RESPONSE

TIME OF

3 PGMS

13.6

6.2

12.7

14.3

16.0

3.5 6.8

7.3

9.2

A7G

HW UTIL

(%)

100

99

99

99

96

91

96

B PE

's

16 PE's

sep.

C7

80%

ABT

sep. C7

90%

ABT

A

3.5

3.5

3.5

5.0

3.5 3.5

3.5

3.5

B

2.0

4.5

3.0

4.0

2.0 2.0

2.0

2.0

C

5.0

-

-

-

5.0

-

-

D

2.5

4.0

5.5

4.5

2.5 3.0

3.0

3.0

A7G

RESPONSE

TIME OF

3 PGMS

2.7

4.0

4.0

4.5

2.7 2.8

2.8

2.8

A7G

HW UTIL

(%)

96

78

86

62

62

52

16 PE'S, 4:

-(MODEL) PROG

RAM MIX

sep.

C7

80%

ABT

A

3.5

3.5

3.5

3.5

B

2.0

3.5

2.0

2.0

C

5.0

5.0

5.5

6.0

D

2.5

2.5

4.0

3.5

A7G

4 PGM

RESPONSE

TIME

3.25

3.6

3.75

3.75

A7G

HW UTIL

(%)

-

68

62

57

FIGURE 7.2: EXPERIMENT DATA (ALL 7ALUES IN >iS . )

77

H A R D W A R E

U T I L I Z A T I 0 N S

(%)

100

75 -

5 0 ■■'

25 ■■

100

IQO

TIME (>jS)

FIGURE V.3: HISTOGRAMS OF 3 ASSIGNMENT POLICIES (U PE's)

78

H A R D W A R E

U

T I L I Z A T I 0 N S

(%)

100

75

50

25

uu

/ /

75

/ / /

/

50 25

^ y^0% RULE/

/

/

/

/ / / ^

/

/^

/

100

75 ■'

50

25

FIGURE V.4: HISTOGRAMS OF 3 ASSIGNMENT POLICIES (8 PE ' s )

79

100

75

50

25

H A R D W A R E

U T I L I Z A T I 0 N S

(%)

100 T

75

50

25 ■•

RULE TOGETHER

TIME ()iS) FIGURE V.5: HISTOGRAMS OF 3 ASSIGNMENT POLICIES (16 PE's)

80

100

H A R D W A R E

17.5

U T I L I Z A T I 0 N S

(%)

100

34.5

TIME (;liS)

FIGURE V.6

HISTOGRAMS OF 3 ASSIGNMENT POLICIES (1 and 2 PE CASES)

81

tool. Additionally, it is maintained tnat the initial results support tne discussion wnicn follows.

Whenever the amount of concurrency in all running processes exceeds tne number of PE's available to meet tfie retjulrements of tne processes, some slowdown in execution time results for some processes. The dual of this result is that, so lone as there are adequate PE's available, no slowdown in any process' execution time results.

Under tne "All Begin Together" (ABT) assignment policy, the data flow hardware becomes "overloaded", resulting in tfie slowdown Just described. For example, program "a", though the first to bei?in processing? under the ABT scneme, is the last to finish under the three program mix. Under the four program mix wltfi 16 PE's, the "c" program ta&es longer only because of its ^reat length and inherent parallelism.

Under tne "80% Rule" assignment policy, the average hardware utilization is lower than tnat under tne ABT policy (for the three program mix). Also, the average of the three programs' execution times is lower tnan tnat under tne AET policy. For the four program mix, the average hardware utilization is sligntly greater. This reflects a better mapping of processes onto the machine. (Average nardware utilization is defined as the average, over the duration of a run, of the hardware utilization percentages at eacn time step of that run. )

82

Under the optimized concurrency vector (CV) approach, programs were mapped onto tne nardware configurations in such a way as to acnieve a hi^n nardware utilization at eacn time step as well as minimize the average response time of tne prograns in tne mix. Tne results indicate average hardware utilizations at least as ni^n as under either of tne otner assignment policies. Also, tne average response times were at least as low as under either of the other assignment policies. This optimized concurrency vector approach should be suitable for machine optimization. Using concurrency vectors generated at compile time, the mapping of additional processes onto tne data flow macnine should probably continue only so long as acceptable average response time for any process is not exceeded. Wnen a process characterized by more inherent parallelism than can be currently accommodated on the data flow macnine is awaiting asslsmment (i. e. mapping onto tne data flow machine), that process' assignment should be delayed until sufficient (or, if necessary, all) PE's are available to parallel process the computation. (A user advisory denoting such a delay would be highly desirable.)

The time spent in preprocessing Jobs in accordance with any assignment scheme to achieve some level of optimization may be unnecessary or even wasteful. This trade-off will have to te examined in greater depth.

83

VI. SOMMAP.Y

Followlne a review of the pertinent literature, a two-part bypothesls was proposed. First, the Petri net-based Requester-Server (R-S) methodolosry 's suitability for predicting the performance of data flow machines was to be tested. Second, it was Hypothesized that tne goal of economically achieving higher speed computation through data flow computing would be unattainable without achieving a high and intelligent degree of mul tiprofframminar. The R-S methodology, a simulation technique, permits the separate specification of the hardware to be evaluated, the software to l)e used in the hardware evaluation, and the policy for allocatlne hardware resources to program requests for service. Accordingly, Petri net models of data flow hardware configurations were quantified in terms of their execution in time, and Petri net models of data flow programs were quantified in terms of the amount of inherent parallelism available for exploitation at each discrete time step, as well as in terms of the implicit data dependencies of tne program. Model programs were "run" on model hardware configurations. Results obtained from the monitor function of tne R-S program were analyzed, with respect to tne performance indices: hardware utilization and response time.

84

Three assignment policies for determining when to nap additional proerams onto a data flow machine were tested:

1. all programs begin together

2. assign an additional program whenever the hardware utilization drops below 30^5

3. assisrn an additional program based on a concurrency vector.

Results show that the R-S methodology is indeed an efficient and easy-to-use tool for investigating data flow architectures. Also, initial results indicate tnat optimized scheduling based upon concurrency vectors is viable for deciding when to map additional processes onto a data flow machine to achieve the objectives of maintaining hieh fiardware utilization and providing acceptable average response time.

85

VII . RECOMMENDATIONS FOR FURTHER RESEARCH

With regarl to the methodology, wort&wniie additions to the R-S program would be user-f riendiy "front-" and "bacfc-ends" wnicti would further simplify both the veneration of input files for tne R-S tool and tne retrieval of desired data from the output file erenerated by each run.

In the area of data flow, simulations in which the hardware definition of tne arcnitecture was varied (as described in section 17. B) could provide insights regarding tne optimal Hardware configuration for tne expected program load. In particular, the quantity of (modelled) PE's siiould ^e increased to a number closer to the amount expected in the actual machine (approximately 512). Of course, in order to model more accurately, tne expected load in terms of the quantity of programs and typical amounts of inherent parallelism shall have to be defined more exactly.

A final open area within data flow research is the development and testing of specific algoritnms using concurrency vectors to permit machine optimization and implementation of a desirable assignment policy.

86

BIBLIOGRAPHI

Acterman, W. B. and Dennis, J. B., "VAL- A Value-oriented Algorithmic Language: Preliminary Reference Manual," Computation Structures Group TR-218, Laboratory for Computer Science, MIT, Cambridge, MA, June, 1979.

Acfeerman, V, B. , "Data Flow Lanffuases," AFIPS Conference Proceedings, v. 48, New TorK, June, 1979.

Aeerwala, T. , "Putting Petri Nets to Wors," IEEE Computer, December, 1979.

Allan, S. J. and Oldeboeft, A. E., "a Flow Analysis

Procedure for tne Translation of Hign Level Languages to a Data Flow Language," Proceedings of the 1979 International Conference on Parallel Processing, 1979.

Allen, A. 0., "Oueueing Models of Computer Systems," IEEE Computer, April, 1980.

Arvind, K. P. and Bryant, R. E., "Parallel Computers For

Partial Differential Equation Simulation," Computation Structures Oroup Memo 178, Laboratory for Computer Science, MIT, Cambridge, MA, June, 1979.

Backus, J., "Can Pro^rammming Be Liberated From the von

Neumann Style? A Functional Style and Its Algebra of

Programs," Communications of the ACM, v. 21, n. 8, August, 1973.

Batcher, £. E., ^/'Design of a Massively Parallel

Processor," IEEE Transactions on Computers, v. C-29, n. 9, September, 1980.

Brocic, J. D. and Montz, L. B., "Translation and

Optimization of Data Flow Programs," Proceedings of the 1979 International Conference on Parallel Processing, 1979.

Buzen, J. P. and Denning, P. J. , ^/'Measuring and Calculating Queue Length Distributions," IEEE Computer, April, 1990.

Coi, L. A., "Performance Prediction of Computer Architec- tures Operating on Linear Mathematical Models," Ph. Ik Thesis, Computer Science Dept., ac Davis Report UCRL-52582, 28 September 1978.

87

Cooper, R. B., Introduction to Queuelng Tneory, MacMlllan, New Yor^, 1^72.

Davis, A. L., "Tne Arcnltecture and System Metnod of

DDMl: A Recursively Structured Data Driven Macnlne," Proceedings of the Flftn Annual Symposium on Computer Arcnltecture, Computer Arcnltecture News, v. 6, n. 7, April, 1978.

Davis, A. L., "a Data Flow Evaluation System Based on

the Concept of Recursive Locality,' AFIPS Conference Proceedings, v. 48, New Tort, June, 1979.

Dennis, J. B.,^"Flrst Version of a Data Flow Procedure

Lanffuaee," Lecture Notes In Computer Science, v. 19, Sprlnger-7eriag, 1974.

Dennis, J. B. and Mlsunas, D. P., "A Preliminary Arcnltecture for a Basic Data-Fiow Processor, Computation Structures Group Memo 102, Laboratory for Computer Science, MIT, Camhrld^re, MA, August, 1974.

Dennis, J. B.,"The Varieties of Data Flow Computers," First International Conference on Distributed Computing Systems, Huntsvllle, AL , 1979.

Dennis, J. B., Bou?nton, &. A., and Leung, C. K.,

"Building Blocics for Data Flow Prototypes," Seventh Annual Symposium on Computer Architecture, ACM-SIGARCE Newsletter, v. 8, n. 3, May, 1980.

Dennis, J. B., Leung, C. K*, and Mlsunas, D. P., "a

Hlgnly Parallel Processor Using a Data Flow Macnine Language," Computation Structures Group Memo 134-2, June, 1980.

Dennis, J. B., "Data Flow Supercomputers," IEEE Computer, November, 1980.

Dhas, C. R., "Estimation of Intrinsic Parallelism In Hieh Level Programs Using Data Flow Execution," Distributed Computing, Proceedings, 21st IEEE Computer Society International Conference, 23 September 1980.

Ferrari, D., Computer Systems Performance Evaluation, Prentice-Hall, 1978.

Flynn, M. J., "Very Hleh-Speed Computing Systems," Proceedings of the IEEE, v. 54, 1966.

88

Gostelow, K. P. and Thomas, R. E., "performance of

a Si^Tiulated Data Flow Computer, IEEE Transactions on Computers, v. C-29, n. 10, October, 1980.

Hamming, R. W.,^/'How Do You Know tne Simulation Is Relevant?," Simulation, November, 1975.

Hwanff, K. and Ni, L. M., "Performance Evaluation and Resource Optimization of >1ultlpie SIMD Computer Oreanizations," Proceedings of the 1979 International Conference on Parallel Processing, 1979.

Jennings, S. C. and Hartel, R. J., "Petri Net Simula- tions of Communications Networss," ^. S. Thesis, Naval Postgraduate School, NPS52-90-003 , Marcn, 1980.

Karp, R. M. and Miller, R. E., "Properties of a Model

For Parallel Computations: Determinacy, Termination, Queueln?," SIAM Journal of Applied Mathematics, v. 14, November, 1966.

Eleinroclc, L., Queueing Systems, Volume Ilr Computer Applications , John Wiley and Sons, 1976.

Kuck, D. J. and others, "Measurements of Parallelism in Ordinary Fortran Programs," IEEE Transactions on Computing, v. C-23, January, 1974.

Leasure, B. R., "Compiling? Serial Laneuaees for

Parallel Machines," M. S. Thesis, Dept . of Computer Science Report 76-805, University of Illinois, Urbana- Champaiffh, IL, November, 1976.

McSraw, J.^^R., "Data Flow Computing- Software Develop- ment," IEEE Transactions on Computers, v. C-29, n. 12, December, 1980.

Miller, R. E., "a Comparison of Some Theoretical Models of Parallel Computation," IEEE Transactions on Computers, v. C-22, n. 8, August, 1973.

Mlsunas, D, 5., "Petri Nets and Speed Independent

Design,' Communications of the ACM, v. 16, n. 8, August, 1973.

Padua, D. A., Kuck, D. J., and Lawrie, D. H., "High

Speed Multiprocessors and Compilation Techniques," IEEE Transactions on Computers, v. C-29, n. 9, September, 1930.

89

Peterson, J. L., "Petri Nets," Computing Surveys, V. 9, n. 3, SepteTiber, 1977.

Hamamoorthy, C. 7. and Ho, G. S., "Performance Eval- uation of Asyncnronous Concurrent Systems Using Petri Nets," IEEE Transactions on Software Eneineeerine , v. SE-6, n. 5, September, 1980.

Hamchandani , C, "Analysis of Asynchronous Concurrent Systems by Petri Nets," Project MAC TR-120, MIT, Cambridge, MA, 1974.

!luggiero, a, and otbers, "Analysis of Data Flow Models Using tne SARA Grapn Model of Benavior," AFIPS Conference Proceedings, v. 48, New Yoric, June, 1979.

Sauer, C. H. and Cnandy, K. M., "Approximate Solution of Oueueing Models," IEEE Computer, April, 1980.

SifaJcis, J., "Use of Petri Nets for Performance

Evaluation," Measuring, Modelling and Evaluating Computer Systems, North-Holland, 1977.

Spragins, J., "Analytical Oueueing Models," IEEE Computer, April, 1980.

Stone, H. S., Introduction to Com-puter Arcnitecture. 2nd Edition, Science Research Associates, 1980.

Stowers, D. M., "Computer Arcnitecture Performance

Prediction for Naval Fire Control Systems," M. S. Thesis, Naval Postgraduate School. NPS52-79-006, December, 1979.

Watson, I. and Guri, J., "a Prototype Data Flow

Computer with Tofeen Labelling, ' AFIPS Conference Proceedings, New Yorlc, June, 1979.

Weitzman, C. , Distributed Micro/Minicomputer Systems , Prentice-Hail, 1990.

Woodruff, J. P., "Scientific Application Coding

In tne Context of Data Flow," Lawrence Livermore Laboratory Report UCRL-81922, 15 December 1978.

Zuberefc, W. M., "Timed Petri Nets and Preliminary

Performance Evaluation," Conference Proceedings of the Seventh Annual Symposium on Computer Architecture, ACM-SIGARCH Newsletter, v. 8, n. 3, May, 1980.

90

INITIAL DISTRIBUTION LIST

No. Copies

1. Defense Technical Informaxion Center 2 Cameron Station

Alexandria, Virginia 22314

2. Library, Cole 0142 2 Naval Postgraduate School

Monterey, California 93940

3. Department Chairman, Code 52 2 Department of Computer Science

Naval Postgraduate Scnool Monterey, California 93940

4. Lyle A. Cox, Code 52CL 4 Department of Computer Science

Naval Postgraduate School Monterey, California 93940

5. Bruce J. MacLennan, Code 52ML 2 Department of Computer Science

Naval Postgraduate Scnool Monterey, California 93940

6. Naval Sea Systems Command 4 Washington, D. C. 20362

Attn: PMS-400

David J. Hoeen, LCDH, USN

91

Tt H< c

Thesis -^^ 193818

h6t85 Hogen

c.l Computer performance prediction of a data flow architecture.

I

/

13 MOV 3 NOV 67

a93 93 3 3U 80

1

\ Thesis h6t85 c.l

193818

Hogen

Computer performance

prediction of a data flow architecture.

3 2768 002 06870 2 DUDLEY KNOX LIBRARY